Improving Biomedical Document Retrieval by Mining Domain Knowledge

نویسندگان

  • Shuguang Wang
  • Milos Hauskrecht
چکیده

When research articles introduce new findings or concepts they typically relate them only to knowledge and domain concepts of immediate relevance. However, many domain concepts relevant for the article and its findings are omitted in the text. This may prevent us from retrieving articles of interest when executing a search query. Approaches such as probabilistic latent semantic indexing (PLSI) overcome this limitation by projecting terms in articles to a lower dimensional latent space and best possible matches in this space are identified. Nevertheless, this approach may not perform well enough if the number of explicit knowledge concepts in the articles is too small compared to the amount of knowledge in the domain. The objective of this paper is to address the problem by exploiting a domain knowledge layer: a rich network of associations among knowledge concepts in the domain of interest. We present a new document retrieval framework that i) extracts associations among knowledge concepts from many documents in the literature corpus; ii) and exploits them to improve the retrieval of relevant documents. We test our approach on the problem of retrieval of biomedical documents and show that it outperforms standard Lucene and BM25 information-retrieval methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

A Web-Mining Approach to Disambiguate Biomedical Acronym Expansions

Named Entities Recognition (NER) has become one of the major issues in Information Retrieval (IR), knowledge extraction, and document classification. This paper addresses a particular case of NER, acronym expansion (or definition) when this expansion does not exist in the document using the acronym. Since acronyms may obviously expand into several distinct sets of words, this paper provides nin...

متن کامل

Improving Keyphrase Extraction from Biomedical Documents Using Domain Specific Feature Set

Keyphrases enable the reader to quickly determine whether the given article is suitable for the reader’s digest. Keyphrases are also important for medical document retrieval and text mining research. Sometimes, the author-assigned Keyphrases or keywords available with the articles are too limited to represent the topical content of the articles. Many medical documents also do not come with auth...

متن کامل

Assessment of approximate string matching in a biomedical text retrieval problem

Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine...

متن کامل

Text Mining in Biomedical Domain with Emphasis on Document Clustering

OBJECTIVES With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. METHODS This paper reviews text mining processes in detail and the software tools a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009